Performance Impact of Data Locality in MapReduce on Hadoop

نویسنده

  • Ju-Yeon Jo
چکیده

As the foundation for MapReduce processing, Hadoop is one of the fundamental technologies in big data analytics. Hadoop breaks up large data into data blocks, replicates them, and stores them in a distributed storage system. Data blocks can be placed in a machine where the data will be processed (data local), in a machine in the same rack (rack-local), or in a machine in a different rack (off-rack). As the location of a data block gets farther from the processing node, a higher data transfer overhead is incurred. Therefore, the location of a data block can significantly influence the performance of MapReduce processing. The data locality problem has been approached in many ways including, scheduling, data placement, networking, partition/key, and framework. While the majority of the data locality improvement effort is concentrated in the early stages of MapReduce, it is possible to extend it to later stages. These approaches are called Shallow Data Locality (SDL) and Deep Data Locality (DDL) respectively. DDL can be achieved in two ways: pre-arrangement of the data blocks in a way to reduce data movement, and/or micromanipulation of the data within the data blocks. It has been shown that DDL can improve the performance of MapReduce in certain applications. This talk will introduce the concept of Hadoop data locality, its impact on MapReduce performance, and past approaches to improve data locality. The process of DDL, the analytical models and simulation results, and the experimental results with common benchmarking tools will also be presented.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Using different scheduling algorithms can affect the performance of mobile cloud computing using Hadoop MapReduce framework. In Hadoop MapReduce framework, the default scheduling algorithm is First-In-First-Out (FIFO). However, the FIFO scheduler simply schedules task according to its arrival time and does not consider any other factors that may have great impact on system performance. As a res...

متن کامل

Improving MapReduce Performance by Data Prefetching in Heterogeneous or Shared Environments

MapReduce is an effective programming model for large-scale data-intensive computing applications. Hadoop, an open-source implementation of MapReduce, has been widely used. The communication overhead from the big data sets’ transmission affects the performance of Hadoop greatly. In consideration of data locality, Hadoop schedules tasks to the nodes near the data locations preferentially to decr...

متن کامل

Performance and energy efficiency of big data applications in cloud environments: A Hadoop case study

The exponential growth of scientific and business data has resulted in the evolution of the cloud computing environments and the MapReduce parallel programming model. The focus of cloud computing is increased utilization and power savings through consolidation while MapReduce enables large scale data analysis. Hadoop, an open source implementation of MapReduce has gained popularity in the last ...

متن کامل

Scheduling algorithm based on prefetching in MapReduce clusters

Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017